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Abstract 

We study the effect that the amount of correlation in a bipartite distribution has on 
the communication complexity of a problem under that distribution. We introduce a 
new family of complexity measures that interpolates between the two previously studied 
extreme cases: the (standard) randomised communication complexity and the case of 
distributional complexity under product distributions. 

We give a tight characterisation of the randomised complexity of Disjointness under 
distributions with mutual information A:, showing that it is Q{^Jn(k + 1)) for all 0 < fc < 
n. This smoothly interpolates between the lower bounds of Babai, Frankl and Simon for 
the product distribution case (fc = 0), and the bound of Razborov for the randomised case. 

The upper bounds improve and generalise what was known for product distributions, and 
imply that any tight bound for Disjointness needs VL[n) bits of mutual information in the 
corresponding distribution. 

We study the same question in the distributional quantum setting, and show a lower 
bound of D((n(A: + 1))^/^), and an upper bound (via constructing communication proto¬ 
cols), matching up to a logarithmic factor. 

We show that there are total Boolean functions fd on 2n inputs that have distribu¬ 
tional communication complexity O(logn) under all distributions of information up to 
o(n), while the (interactive) distributional complexity maximised over all distributions is 
©(logd) for 6n < d < This shows, in particular, that the correlation needed to 

show that a problem is hard can be much larger than the communication complexity of 
the problem. 

We show that in the setting of one-way communication under product distributions, 
the dependence of communication cost on the allowed error e is multiplicative in log(l/e) 

- the previous upper bounds had the dependence of more than 1/e. This result explains 
how one-way communication complexity under product distributions is stronger than 
PAC-learning: both tasks are characterised by the VC-dimension, but have very different 
error dependence (learning from examples, it costs more to reduce the error). 
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1 Introduction 


The standard way to attack the problem of showing a lower bound on the randomised com¬ 
munication complexity of a function / is to choose a probability distribution // on the inputs, 
and then show that the deterministic distributional complexity is large for / w.r.t. /r - i.e., 
that any deterministic protocol that computes / with small error under // must communi¬ 
cate much. This approach eliminates the need to argue about the randomness used by the 
protocolli] 

It is well known that this approach can be used without loss of generality, due to von 
Neumann’s minimax theorem (see [20]; the same principle applies to many nonuniform com¬ 
putational models): 

max £>(*(/) = Re{f), 


where Di^(f) denotes the deterministic complexity of protocols that compute / with error e 
under the distribution // of input to /, and Re{f) is the public coin randomised communication 
complexity of / with worst-case error eH 

As a matter of convenience, one first tries to use a simple distribution jj,, for instance 
the uniform distribution, or more generally, product distributions over the inputs to Alice 
and Bob. This works for some problems, like Inner Product modulo 2 [7|. However, Babai, 
Frankl, and Simon [3| observed that for the Disjointness problem DISJ one cannot obtain 
lower bounds larger than D(-y/relogn) under any product distribution, i.e., they show that 
an upper bound of 0{^/nlogn) holds for every product distribution. They also give a lower 
bound of under a product distribution. Later, Kalyanasundaram and Schnitger [IB] 

obtained the tight 0(n) bound, and Razborov [22] showed that indeed D^{DISJ) = 0(n) 
for an explicit simple distribution /i, for any sufficiently small constant e > 0 (that such a /r 
exists is immediate from the result in [16] and the minimax theorem, but their proof does not 
exhibit such a distribution explicitly). Distributional complexity under product distributions 
has been also frequently used to show structural properties like direct product theorems 
(e.g., [I5l[l2]). Furthermore, distributional communication complexity is the natural average 
case version of communication complexity, and it makes sense to study this for distributions 
that are ‘easy’, in order to get a different model than randomised complexity. It seems natural 
to measure “easiness” via mutual information. 

For many years it was open how large the gap between Rl^^{f) = max^ product 
and Re{f) (for constant e > 0) can be. Sherstov [25] finally gave a proof that there are 
total Boolean functions /, where the former is 0(1) and the latter is D(n). In his result 
/ is not given explicitly. Recently Alon et al. [2] have given the following optimal explicit 
separation. Consider the problem where Alice gets a point and Bob a line from a projective 
plane containing 2®0) points and lines. The VC-dimension of this problem is at most 2, which 
implies that the distributional complexity under any product distribution is 0(1) (even for 
one-way protocols, m), whereas the sign-rank of the communication matrix is and 

hence the randomised (even unbounded error) communication complexity is D(n). 

This leaves open a more precise investigation of the amount of correlation in n needed to 
make D^[f) equal to R{f)- It is natural to quantify this via the mutual information I[X : Y), 

^ We note that the popular information complexity method (see e.g. M) also uses distributional complexity, 
but does not seek to eliminate randomness from protocols. 

^Throughout the paper we do not consider private coin randomised protocols. 
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when the input {X,Y) is drawn from fx. We define the following measure: 

i?P(/) = max D^if). 
fM:I{X:Y)<k 

We note here that the quantity on the right hand side does not change if randomised 
or deterministic protocols are allowed, because in the distributional setting the randomness 
can be fixed without increasing the error (under any distribution). The investigation of this 
measure has been initiated by Jain and Zhang m in the setting of one-way communication 
complexity (we discuss their contribution at the end of Section fOli . We note that i?^-”'(/) = 
Re{f) for all functions / : {0,1}" x {0,1}"' —)■ {0,1}. 

This family of complexity measures allows us to investigate how much correlation is 
needed in the input distribution to get good lower bounds. We have 3 main applications. 
First, we closely investigate the case of the Disjointness problem. Second, we show that a 
certain problem exhibits a threshold behaviour, i.e., only with almost maximal correlation 
can a tight lower bound be proved, and this correlation can also be larger than the actual 
communication complexity of the problem. Third, we investigate the dependence of one-way 
communication complexity under product distributions on the allowed error. 

1.1 The Disjointness problem 

In the Disjointness problem (DISJ), Alice and Bob receive, respectively, subsets x,y C 
{!,...,n}, and their task is to decide whether x and y are disjointi This is one of the 
most-studied problem in communication complexity, which arguably has the biggest number 
of known applications to other models (see m)- We give a complete characterisation of the 
information-bounded distributional complexity of Disjointness for all values of A: = I{X : T), 
both in the randomised and in the quantum case. 

Theorem 1. For all 0 < k < n and constant e we have 

1. Ri^^{DISJ) = 0(v^n(fe + l)). 

g. Qi^^{DISJ) = d{{n{k + 1))^/^). 

3. Ql^^iDISJ) = n{{n{k + 1))^/^). 

Previously, for classical protocols, a lower bound of Vl[y/n) was known for a product 
distribution [3], and the D(n) lower bound by Razborov [22] uses a distribution /r with 
I^(X : Y) = 0(n). Babai et al. [3] also gave an upper bound of 0{^/nlogn) for product 
distributions, which we improve by a log-factor. The quantum case has not been considered 
before. 

0ur results interpolate between the previously-known extreme cases, and also show that 
one needs input correlation D(n) to prove tight lower bounds. Interestingly, the bounds 
depend inverse-polynomially on the error probability, except for the extreme cases of zero 
correlation and of maximal correlation. We also note that a nearly-optimal complexity for ran¬ 
domised protocols can be achieved in a protocol with two rounds of communication (though 
not in one round). 

®We will often view x and y as binary n-bit strings, implicitly assuming the natnral correspondence between 
the elements of {0,1}" and po';n([n]). 
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The tight bound in the randomised case is based on a two-phase protocol, in which 
the players first remove “uninteresting” elements from their sets, until they are (essentially) 
small enough to be communicated. For the quantum case this two-phase approach cannot be 
optimal, because the first phase reveals “too much” information about the input. Therefore 
we give a completely different protocol for the quantum case, in which the players identify 
uninteresting elements a priori. This approach is tight up to a log-factor. 

1.2 Mutual information in hard distributions 

Note that for DISJ the complexity increases with the information parameter, and the ran¬ 
domised communication complexity bound 0(n) is reached only once the information in the 
hard distribution reaches n(n). For other problems like Inner Product mod 2 the tight bound 
of n(n) is reached already under product distributions [?]• But can the mutual information 
between the input sides that is required to show a tight lower bound ever be larger than 
the actual communication complexity? I.e., is it ever necessary to use distributions that are 
(much) more strongly correlated than the communication lower bound we want to show, or 
is it always possible to prove a tight lower bound for a (total) function / by using a hard 
distribution with I{X : Y) < poly{R{f))? A weak example is the quantum complexity of 
Disjointness, where the tight Q{y/n) bound is only reached when the information reaches 
D(n), but even here the complexity increases gradually with the information. We resolve this 
question, although our example is not explicit. 

Theorem 2. For every n < d < there is a function fd '■ {0, !}"■ x {0,1}” —)• {0,1} that 

has R{fd) = 0(logd), but under all bipartite distributions with mutual information less than 
n/1000 the communication bound is (fd) < O(logn). 

Hence for fd the complexity stays low until the information is almost maximal, and then 
shoots up. 

1.3 Dependence of on e @ 

Finally, we investigate the error dependence of R^-^lf) for arbitrary /. In the unrestricted 
case, by standard boosting techniques we have Re{f) < 0{Ri/^{f) ■ log(I/e)). We call a func¬ 
tion / and a class C of distributions on the inputs with max^gc' Re if) < 0(max^gc Dy^if) ■ 
log(l/e)) boost-able. For this dehnition we require the above to be true for all e. One can 
easily show that there are distributions p and functions /, such that e.g. Dy^{{f) = D(n) 
and Dy^if) = 0, by placing a hard problem with weight 1/3 in an otherwise constant matrix, 
so for a fixed distribution p one cannot in general expect the error dependence to behave 
nicely. 

Boost-ability is a property of a class of distributions. The class of all distributions clearly 
has the property, but what about the class of distributions with information at most /? In 
particular, what about 1 = 0? 

The issue is particularly interesting for product distributions, because boost-ability can 
be used to derive upper bounds on R^-^{f) from upper bounds on R^^^if): due to the 

^The same result has been obtained recently by Molinaro et al. m independently. The methods being 
used in the two works are similar; m has been published prior to the current publication, while our results 
have been presented during a public talk at BIRS prior to either publication. 
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substate theorem (Fact 0] below), a protocol that solves / under all product distributions 
with error can be used to solve / under distributions with I{X :Y) = k with error e, 

hence boost-ability would imply < 0{{k -|- 1) • R^y^{f)/e) for all /. 

We will use the super-script “A —)• B" to denote one-way communication. In this model 
the class of product distributions is boost-able: 

Theorem 3. < 0{Rf^^’^=^if)-logil/e)). 

We also show that when the information is between 1 and then neither distribu¬ 

tional randomised nor distributional quantum protocols are, in general, boost-able, see our 
Corollaries [22] and [29l 

It is well known that Rf^^’^~^{f) = Q{VC{f)) [19], where VC{f)) is the VC-dimension 
of the set of rows of the communication matrix. This even extends to the quantum case 
Oil]. The VC-dimension is also known to characterise the hardness of PAC-learning (see 
the monograph by Kearns and Vazirani m) - in fact, the previous proofs of the upper 
bound on R^^^’^~^{f) in terms of VC-dimension have been done by explicitly simulating 
learning algorithms in the one-way communication model: Random examples are generated 
using a public coin, and Alice classihed the examples in order to teach Bob a row of the 
communication matrix of / in the PAC sense (examples were generated from the public coin, 
and Alice labelled those examples spending 1 bit per example). 

The main limitation of this approach is that for PAC learning one needs n(l/e) examples 
to achieve error 1/e. On the other hand, this approach ignores two strengths of the one-way 
model: First, Alice and Bob know the underlying distribution; second, Alice can do more than 
simply label examples. One can interpret the one-way communication model under product 
distributions as a learning model, in which Alice is an (old-fashioned) teacher, who teaches 
by monologue, but using shared randomness that does not count towards the communication. 
Does such a teacher offer any advantage over learning from random examples? At hrst glance 
no, since both models are characterised by the VC-dimension, and one could conclude that 
learning from experience is all it takes. Our Theorem [3l however, shows that the final error 
can be made much smaller when learning from a teacher, comparing to learning “just from 
experience”. Note that in practice 1/e can also easily become the dominating factor in the 
complexity of a learning algorithm. 

The main idea in our protocol is that Alice and Bob can beforehand agree on an e-net 
among the rows of the communication matrix, and Alice simply sends the name of the nearest 
row in the net. During a PAC learning algorithm, on the other hand, the e-net is generated 
from examples, which is more costly. 

We can now discuss the previous result of Jain and Zhang M- They show that for all 
total Boolean functions / in the one-way model: 

< 0{{k + 1) • • l/e^ . log(l/e)). 

This extends the VC-dimension upper bound to distributions with nonzero information. Their 
protocol for information-A: distributions is constructed by simulating the PAC learning al¬ 
gorithm for the row x, and by generating examples y', f{x,y') using a rejection-sampling 
protocol. We can improve the error dependence to 1/e by the following idea. Due to the 
Substate Theorem (Fact 0] below) it is enough to find a protocol that has error under 

the product of the marginal distributions of a distribution y (with information k). But this 
can be achieved with communication 0{{k -|- l)/e • R^^^’^~^{f)) according to Theorem 01 
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2 Preliminaries and Definitions 


2.1 Information Theory 

We refer to [8] for standard definitions concerning information theory. 

The relative entropy of two distributions on a discrete support is denoted by D{p\\a). 
The relative max-entropy is Dao{p\\(y) = max^^log(/9(a;)/cr(a:)). Note that these quantities are 
infinite, if the support of a does not contain the support of p. We mostly consider bipartite 
distributions on {0,1}"' x {0,1}". The mutual information is I{X : Y) = D{p.\\px x ^^Y)^ 
where p is the joint distribution of (X, Y) and px, and py are the two marginal distributions 
of p. We also use the quantity Ioo{^ ■ T) = Doo{p\\px x py). If we want to indicate the 
distribution used we write its name as a superscript, like I^{X '■ Y). When (X,Y,Z) ~ p, 
we write p{Y) to address the marginal distribution of Y, and sometimes (when we feel that 
a reader might benefit from such “expansion”) we write p{X,Y,Z) to address p itself. 

We first state the following well-known fact, see |13| . 

Fact4. 1. I{X :Y) < I^{X :Y). 

2. For a given p there is a p' with \\p — p'\\ < e, and /^(X ; Y) < /^(X : Y) ■ 4/e, where 
11/i — /i'll is the total variation distance between p and p'. 

We will use the following lemmas and facts. The first follows from the definition of relative 
entropy. 

Lemma 5. Let p be a bipartite distribution, p = pA x Pb, cind a = a a x ub any product 
distribution. 

Then D{p\\a) = D{p\\p) + D{p\\a) = I^{X : X) + D{p\\a). 

The following is a consequence of the log sum inequality. 

Lemma 6. Let p, a be distributions (for concreteness on {0,1}" x {0,1}"/, and E an event. 
Then we have that Y,x,y&E T{x,y)log{p{x,y)/a{x,y)) > Tna-K{-1, p{E)\og{p{E)/a{E))}. 

Lemma 7. Let p,a be distributions on {0,1}" x {0,1}", E an event, and p' the distribution 
p restricted to E. Furthermore, assume that under p we have that Prob{E) = a. Then 
D{p'\\cr) < {D{p\\a) -|- l)/a - logo:. 

Proof. For all x,y & E we have p'{x,y) = p{x,y)/a, otherwise p'{x,y) = 0. 


D{p\\a) 

^ p{x, y) log 




(*) 

> 2^ y) log 

x^y^E 


> X] h''{x,y) • a • log 

x^yGE 


h{x,y) 

a{x,y) 

T{x,y) \ _ ^ 
crix,y)J 

T'{x,y) •« 

a{x,y) 


- 1 


= Zl(/i'||(T) • a-|-aloga — 1, 

where for (*) we use Lemma E] with the event {0,1}" x {0,1}" — E. 
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We will use the following rejection sampling protocol from |10j . 

Fact 8. Let p, and v be distributions on {0,1}” with Il(/r||z^) = k. Assume that Alice and 
Bob both know u, and can create samples from v using a public coin. Then Alice can send a 
message of expected length k + 2 log k + 0{l) to Bob, which allows Bob (and Alice) to obtain 
a shared sample from the distribution p. The expectation is over the public coin tosses, and 
Bob’s sample is distributed exactly with p. 

To bound the average information during a protocol we have the following lemma. 

Lemma 9. Let p he a distribution on {0,1}” x {0,1}”, and X,Y the corresponding random 
variables. Let IZ he a partition of {0, 1}” x {0,1}” into rectangles R (where R also indicates 
the random variable induced under p). Then we have I(X : Y) > I{X : Y\R). 

Proof. Note that if TZ consists of only one rectangle covering everything, then we have I(X : 
Y) = I{X : Y\R). We show that for any TZ, if we refine TZ into TZ' by splitting a single 
rectangle A x B (w.l.o.g. splitting the columns into Bi and B 2 ), then I(X : Y\R) does not 
increase. Since every TZ can be obtained by starting with a single rectangle and splitting 
rectangles, this implies the lemmaH 

Denote by pn the distribution p restricted to a rectangle R = AxB and re-scaled. Denote 
by vr the product of marginals of pR. We have that vr{x, y) = p(A, y)-p(x, B)/p(R)‘^, where 
= Y.x€AT{x,y) and pix,B) = Y^yeBhi^^y)- 
Since I{X : Y\R) = Y)r 12x,y&R v) y{A(y)'y{x}i) ’ particular) R = 

AxB into Ri = A X Bi and R 2 = A x B 2 , the expression for I(X : Y\R) is changed by 
adding 


x,y&Rl 


p(x,Bi) 


+ X] T{x,y)'iog 

x,y&R2 


T{R2) 

P{x,B2) 


= '^T{x,Bi)\og 
x&A 


t{Ri) 

p(x,Bi) 


+ '^T{x,B2)\og 
x&A 


t{R2) 

P{x,B2)’ 


and subtracting 


T{x,y)log 

x,yGR 


pjRi) + p(R 2) 
p(x,Bi) Y p(x,B 2 ) 


'^{p(x,Bi) + p{x,B2))\og 
xeA 


pjRi) + p(R 2) 
p{x,Bi) + p(x,B 2 )' 


The log sum inequality implies that the latter is not smaller than the former and we are 
done. ■ 

®This is obvious for partitions arising from deterministic communication protocols. The same is true without 
assuming a protocol. Instead of showing how to split rectangles to generate the partition we may argue that 
starting from a partition we can merge rectangles until only one rectangle is left. To show this it is enough 
to prove that in any partition of the matrix positions U x V into two or more rectangles there must be two 
rectangles that have the same row- or column-set. The case of two rectangles is trivial, and for partitions into 
m > 3 rectangles we first consider the case where a single rectangle spans all rows or all columns, in which the 
remaining part can be treated via induction. If there is no such rectangle, we can take any rectangle R = AxB 
from the partition, and consider the three regions S = A x (V — B),T = (U — A) x B,Q = (U — A) x {V — B). 
It is easy to see by induction that either SQ or TQ must contain two rectangles that share the same row- or 
column-sets. 
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The same follows for the transcript of a randomised protocol. 

Lemma 10. Let C he a random variable (the public coin) and TZ{c) be a partition o/{0,1}” x 
{0,1}” into rectangles R (depending on a value c of C). Let fi denote a distribution on 
(0,1}” X {0,1}”, independent of C. Then I{X :Y) > I{X : Y\R,C), where R is the random 
variable that represents a rectangle from TZc when C = c. 

Proof I{X : Y\R, C) = EJ(X : Y\R, C = c) < I{X,Y\C = c) = I{X : Y). ■ 

The next lemma follows from a calculation and shows that a distribution can decrease a 
joint probability compared to the product of marginal distributions only in the presence of 
mutual information. 

Lemma 11. Let X,Y be Boolean random variables with a joint distribution p, and marginal 
distributions pA,hB- If hA{X = 1)p,b{Y = 1) > 2p,{X = Y =1), then : Y) > fiA^X = 
iVB(y = i)/5. 

Finally, we show that this is true for any product distribution, not just the product of 
marginals. 

Lemma 12. Let X,Y be Boolean random variables with a joint distribution p (and set 
p = PA X L-b), and a any product distribution. If aA{X = l)(Ts(y = 1) > = Y = 1) 

then D{p\\a) > a{X = Y = 1)/16. 

Proof. If p{X = T = 1) > a{X = Y = l)/2, then by the above lemma D{p\\a) > D{p\\p) = 
I^{X : Y) > p{X = Y = l)/5 > (j{X = Y = 1)/10, because u is a product distribution 
and the relative entropy of p and a product distribution is minimal for p. If p[X = Y = 
1) < a{X = Y = l)/2, then we can bound D{p\\a) > D{p\\a) = D{pA\\(yA) + L){pB\\(yB)- 
Assume that a = pa{X = 1) < = <yA{X = l)/\/2. Then (1 — a) log((l — a)/(l — (I)) + 

alog(a//3) > /3/16. Hence in this case D{p\\a) > D{pA\\crA) > /3/16 = = 1)/16 > 

aA{X = 1)itb(F = 1)/16. Other cases follow by symmetry. 


2.2 Communication Complexity 

We assume familiarity with classical and quantum communication complexity. For the former 
consult [2U] . the latter is surveyed in [S]. We concentrate on distributional complexity, which 
we define here. 

Definition 13. The distributional complexity Df(f) is the minimal worst case communi¬ 
cation cost of any deterministic protocol that computes f with error e under p. Similarly 
we define Re (/) for randomised public coin protocols and Qe (/) for quantum protocols (we 
consider quantum protocols with shared entanglement, but do not use the entanglement in 
our protocols). When we drop the error e from the notation, we set e = 1/3. When we drop 
the superscript we mean the ordinary, worst-case communication complexity. 

We observe that Re{f) = Deif) for all /, p, e, because one can fix the public coin random¬ 
ness without increasing the error. Hence, we adopt the i?-notation, and use randomness in 
upper bounds and deterministic protocols in lower bounds. Note that Qtif) can be smaller 
than Re{f), for instance for Disjointness under the hard distribution exhibited by Razborov 
[22] . where R^{DISJ) = 0(n), since the quantum complexity of DISJ is at most 0{^/n) [1]. 

We consider functions / : {0,1}” x {0,1}” ^ (0,1}. 


Definition 14. Define hy D[k) the set of distributions on the inputs that have I{X :Y) <k. 

We define Re if) and use an analogous definition for the quantum 

case. 

Clearly R{f) = R^-'^{f) and R^^^{f) is the complexity under the hardest product distri¬ 
bution. 

Definition 15. One-way protocols allow only a single message from Alice to Bob, who pro¬ 
duces the output. We indicate this model by a superscript, like R^^^’^-^{f ). 

Finally, we note the following fingerprinting technique |20j . 

Fact 16. There is a public coin one-way protocol that checks equality of strings (of any 
length) with error 1/2^ and communication k. 


3 Randomised Complexity of Disjointness 

3.1 Upper Bound 

In this section we prove the upper bound for DISJ under bounded information distributions. 

First we consider the case of 0 mutual information, for which we show an upper bound of 
0(y^log(l/e)). Let p be a product distribution on the inputs to DISJ. Babai et al. [1] already 
show a protocol of cost 0(y/n\ogn\og(l/e)) [they do not state the dependence on e, which is 
however easy to derive from their proof]. Note that one can combine their protocol for product 
distributions with the Substate Theorem (Fact[l|) to get a bound of 0{y/n(k + 1) logn/e) on 
the distributional complexity under distributions with information k: every distribution with 
information k approximately sits with probability 1/2^^/*^ inside the product of its marginal 
distributions, hence it is enough to use a product distribution protocol with very small error. 
This bound is worse in the dependence on k than what is proved below. 

Theorem 17. RI^^{DISJ) < 0(v^ • log(l/e)). 

The proof is in the appendix. The main issue here is to achieve the small error dependence. 
The protocol has a 2-phase structure, where in phase 1, assuming that Bob holds a large set 
and that the probability that a; n y' = 0 is large, random y' are drawn using the public coin 
and, if disjoint from x, removed from the universe (initially {!,...,n}). After doing this 
sufficiently many times, the universe becomes small, and in phase 2 we use the small set 
disjointness protocol due to Hastad and Wigderson Ha- 

Now we turn to distributions with more information. The protocol has the same structure, 
but we need to sample from a distribution of y' that is not independent of x, which takes 
communication. The protocol also does not have the same error dependence, which we show 
is unavoidable later. Due to this we may just analyse expected communication, and show that 
the worst case communication cannot be more than 1/e the established bound by appealing 
to the Markov bound. 

Theorem 18. Rl^^(DISJ) < 0{^Jn{k+ T)/e^). 

The proof is in the appendix. The main idea is to follow the 2-phase approach, and shrink 
the universe until is has size S = sjn(k + 1). At this point the Hastad-Wigderson small set 
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Disjointness protocol m takes over. To shrink the universe we need to sample inputs y' 
from the distribution conditioned on x, and on being disjoint from x. This is achieved by 
using the rejection sampling protocol of Fact [8j We need to bound the information increase, 
but on average the hrst phase of the protocol removes S elements from the universe using 
communication 0{{k + l)/e). There are at most n/S such iterations in phase 1, hence the 
expected communication is at most 0{n/S ■ {k + l)/e). Another factor of 1/e is lost to turn 
this into a worst case bound by appealing to the Markov bound. 

In the next section we will also show a lower bound of ^{y/n/e), so the error dependence 
cannot be made logarithmic, in contrast to the the 0 information case. 

One more issue we would like to consider is the number of rounds used. The above 
protocol can easily use a large number of rounds, and it is not immediately clear whether 
this is necessary. It is well known that the complexity of DISJ under product distributions 
for one-way protocols is 0(n) [19] . We have the following modification that saves most of the 
interaction. 

Theorem 19. 1. The complexity of DISJ under distributions with information at most k 

for protocols with 2 rounds is at most 0{-\/n{k + 1) logn/e^). 

2. The complexity of DISJ under distributions with information at most k for 0(log* n) 
rounds is at most 0{^n{k + l)/e^). 

3. In the case of 0 mutual information, the error dependence drops to a factor o/log(l/e). 

Proof. For the hrst item we observe that in phase 1 Alice can simply act as if Bob’s set was 
large, and continue to let him discover y'’s that are disjoint with x until Ui is guaranteed 
to be small. This does not increase the bound on the communication. After this Bob can 
tell Alice, in which ‘round’ his set really became small, so that she can recover the proper 
universe Uj. He also sends her his set using ynpT+lJlogn bits. Note that in this protocol 
only Alice learns the result. 

For the second item we do as above, but when Bob’s set is small also repeat the same in 
reverse until both sets are small. Saglam and Tardos [23] have a protocol that solves small 
set disjointness in phase 2 in 0(log* n) rounds with communication 0{^/n{k + 1) log(l/e)). 

Finally, note that for product distributions we can use the same modifications to the 
protocol described in Theorem 1 171 ■ 

3.2 Lower Bound 

In this section we prove that the protocol of the previous section is optimal (except regarding 
the exact dependence on e). 

For the lower bound we employ a distribution, depending on n and k, such that the mutual 
information of the two inputs according to the marginal distributions is at most k; we then 
prove an n{k + 1)) lower-bound for the distributional complexity under this distribution. 
In what follows we consider k = k{n) as being S o(n), since for k = D(n) the upper bound 
on the information is trivial and the lower bound on the communication is known. 

Let c = and m = Cy/n{k -\- 1). Note that m = o(n) as well. Now yn,k can be 
defined as the distribution obtained by mixing two distributions. Un^k is uniform on pairs 
of disjoint subsets of {1,... ,n} of size m, and an,k is uniform on pairs of subsets of size m 
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with an intersection of size 1. Then fin,k = (3/4) • i'n,k + (1/4) • This is essentially the 
distribution used in the proof by Razborov [22], but with smaller sets. 

We show in the appendix that the information is bounded by k. 

Theorem 20. For any sufficiently small e > 0 we have that De^’^{DISJ) = fl(\/ n{k + 1)), 
and hence that Rl-^{DISJ) = Fl{^n{k + 1)). 

The proof is similar to that of the corresponding lower bound by Razborov [22] (for 
k = 0(re)). A difficulty comes from the fact that Razborov’s entropy “counting” argument 
no longer works in our case, because in that argument a linear number of terms have their 
entropy upper-bounded as H Q) = 1. Since we still have to deal with a linear number 
of terms while having much less total entropy, we require a finer combinatorial counting 
argument instead. 

Now we give a simple argument that shows that error dependence cannot be logarithmic 
in 1/e. 

Theorem 21. Rl-^{DISJ) = Fl{-\Jn/e) for e > Fl{\/n). 

Proof. Above we have described a distribution Hri,k with information at most k such that 
Fl{-\Jn{k + 1)) communication is needed for some constant error 5. We define Tn^k to be 
l/(2fc) ■ iiri,k + (1 — 1/(2A:))/9, where p is some product distribution for DISJ that puts weight 
1/2 on 1-inputs. Clearly, for error h/(4A;) the communication must be at least Ft,{\Jn{k + 1)). 
Set k = A5/e (note that k <n). 

R remains to show that the information under r is at most 1. Let E be an indicator 
random variable that indicates that x,y have been chosen according to pn,k- Then I{X : 
Y) < I{XE : Y) = I{E : Y) + I{X : Y\E) < H{E) + {l/2k) • k < H{l/{2k)) + 1/2 < 1. 


Corollary 22. The class of distributions with information k with 1 < k < is not 

boost-able for randomised protocols. 

Proof. Consider k = 1. We have that R^^^{DISJ) < 0{ffin). If distributions with at most 
1 bit information were boost-able, then we would have RI-^{DISJ) < 0{y/n\og{l/e)). But 
the left hand side is at least n(y^n/e), which puts a lower bound on e, whereas boost-ability 
should work for all e. 

In the case of larger k we use the same proof, to get that ^/e ■ log(l/e) > Fl{\/y/k + 1), 
which remains a restriction on e until k exceeds and the assumption of Theorem 1211 

is violated. ■ 

4 Quantum Complexity of Disjointness 

4.1 Upper Bound: First Attempt 

Consider the two-phase approach from the previous section. The second phase ‘quantises’ 
readily, if we do not care about log-factors: Simply use distributed quantum search by am¬ 
plitude amplification to obtain a quadratic speedup in this part |6]. We mention here that 
the tight protocol for DISJ due to Aaronson and Ambainis pQ does not seem to work well for 
the small set case, and so we do not know if the logarithmic factor is needed or not. 
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The problem is the first phase of the classical protocol, which seems impossible to quantise. 
Since phase 2 is now cheaper one can re-balance the costs of the two phases (details are left 
to the reader) and find a protocol with cost 0{{n{k + 1))^^^- 

In the next section we will show that this bound is not optimal. We do note here, however, 
that the error dependence for the case I{X : T) = 0 is a factor of 0(—loge) for the above, 
which will not be the case in the protocol we present next. 


4.2 Upper Bound: Almost Optimal Protocol 

We now describe a different approach that also works in the classical case, but loses a logarith¬ 
mic factor and has worse error dependence for product distributions. The approach we use 
identifies two moderately-sized blocks of “interesting” positions (i.e., the blocks are subsets 
of [n]), such that Alice can conduct a search efficiently on one block, and Bob on the other 
one. The situation when the input sets intersect, but not on any interesting position, will be 
“unlikely”. This conforms to the intuition that if “large” x and y come from a product dis¬ 
tribution that puts constant weight on 1-inputs, then there must be many “semi-interesting” 
or “uninteresting” positions - i.e., such i G [n] that not both f G x and z G ?/ is likely. 

Let /X be a distribution on {0,1}" x {0,1}" with U(A : Y) < k. Denote by Ei the 
event that XjYj = 0 - i.e., X and Y are disjoint on {!,..., z — 1}. Let yi be 

/i, conditioned on E^. Let qf = Prob{Yi = 1\X = ^Ei) = Prob^^{Yi = 1|A = x) and 

py = Prob{Xi = l|y = y,Ei) = Prob^^{Xi = l|y = z/)0 

The Protocol. 

Let r = ^^/^{k+i)n\ Ca = Xr\ {i\q^ > r} and Cb = Yr\ {i\pY > r}. Alice locally computes 
Ca and, using Bob as an oracle, applies Grover’s search to check whether Ca r\ Y = 9 
with error at most £ - unless IC^I > in which case the protocol halts and declares 

“A n y 7 ^ 0”. Bob does the same for Cb- If an intersection has been found, the protocol 
declares “A n A / 0”. Otherwise, it is declared that “A n A = 0”. 

Intuitively, the protocol requires that each player searches among those positions, where 
he has T’ and the opponent is likely to have T’ as well. 

The communication cost of the protocol is 


O 




log n 


= O ^{k + l)n ■ log n • 


log ye 


3/2N 


Error Analysis. 

The protocol can make a mistake in one of the following 3 cases: (a) when (Ca H A) U 
(Cb n A) 7 ^ 0 but this fact has not been detectec0; (b) when \Ca\ > l^'sl > but 

A n A = 0; (c) when (Ca H A) U (Cb H A) = 0 but A n A 7 ^ 0. In the first case, the error can 
only happen if Grover’s search fails - with probability at most e for each player (at most 2 e 
in total); in the second case, the intersection was empty in spite of the fact that, conditioned 
on “A = x”, the probability of this has been at most (1 — < e, or similarly for “A = y” 

and |Cb| (or both) - this adds less than 2e to the error. So, the combined probability of the 
cases (a) and (b) is less than 4e. 

®To the end of Section 1121 all the probabilities and the expectations are taken w.r.t. fj,, unless stated 
otherwise. 

^Note that (Ca n T) U {Cb n A) = {i € Ca U Cs|Ai = A = l}. 
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We will show that (c) occurs with probability 0{e). For that we will introduce a classifi¬ 
cation of the positions i € [n], and we need a few more definitions to describe our classes. By 
X, y we denote prefixes of strings x, y of length i — 1, where i is usually clear from the context. 
The random variable X is the prefix of length z — 1 of the random variable X (Alice’s inputs), 
and similarly for Y. 

Denote pf = Prob{Xi = l\Ei,X = x), gf = ProbiYi = l\Ei,X = x), and similarly for 
and q\. Denote also p^ = Prob{Xi = l\Ei,Yi = 1,Y = y), and similarly for p[^, q'f and q^. 
Denote by rf = pfq'^ = p[^qf the probability that Xj = 1) = 1 under the conditions X = x 
and Ei, and similarly for other conditions (i.e., the super-script specifies the condition): say, 
Tj is the probability that Xi = Yi = 1 conditioned on Ei, and so on. Let Sj = Prob{Ei), and 
like above, use super-scripts to specify conditions (sf = Proh{Ei\X = x), and so on). 

We say that a position i G [re] is had for x if xj = 1 and qf < eq'f. Informally, these are 
the positions where x “depresses” the probability of intersection compared to x. Similarly 
define i's that are bad for y (y^ = 1 and p^ < ep^), and call i bad if it is bad for x or for y. 

Say that i is lucky for (x, y) if Pi < {k + l)pf je^. Informally, these are the positions 
where the probability of the event “X, = 1”, conditioned on x is not much higher than 
the probability of the same event, conditioned on y and (more importantly) on the event 

“T, = 1”. 

Finally, we call every i G Ca U Cb chosen. Recall that we are dealing with case (c) - 
that is, we are analysing the probability that X intersects Y over the non-chosen coordinates. 
There are 3 cases to consider: when the first intersection is on a bad position, when it is on 
a non-chosen, non-bad, lucky position, and when it is on a non-chosen, non-lucky position. 

Denote by V) the event that i is bad for x, and by Wi the event that i is bad for y. Set 
qf = Prob{Yi = 1 A Vi\Xi = 1, X = x, Ef and p'f = Prob{Xi = 1 A Wi\Yi = l,Y = y, Ef. 

For every i G [re], the probability that it is bad for x and the first intersection is on i is 
p^qfsf. Similarly, the probability that it is bad for y and the first intersection is on it is 

-fit V V 

Pi Qi Si- 

The following lemma shows that the first intersection cannot be on a bad position often. 
Lemma 23. qf < eqf. 

Proof. Denote by Vi{x) the property that x is A-bad for i and by Ei{x,y) the property that 
X, y are disjoint on {1,..., i — 1}. 


ProbiVi /\Yi = 1 /\ Xi = 1 /\ X = X h Ef 
Prob{Xi = 1 A X = X A Ef 

= p{x,y) /Prob{Xi = 1 h X = X A Ei) 

x-.Xi=l,xi,...,Xi-i=x,Bad{x,i) y:yi=l,Ei{x,y) 

= Yl ' P^ob{X = xAEi) jProbfXi = \AX = xAEi) 

< eqf • Y Prob{X = x A EfjProbfXx = \AX = xAEi) 

X\Xi = \^X\^...Xi—\=X 
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Therefore, 


i:Vi{x) 




i < e • E; 


z-.Vi(x) 




< e • E, 


Epf-^ 


"sf < e, 


and similarly for p^q^sf] so, the first intersection is on a bad position (either chosen or 
not) with probability at most 2e. 

Fix prefixes (T, y). Assume i is lucky for (x, y). Now consider some input (x, y) consistent 
with (x, y), such that i has not been chosen on (x,y) and that i is not bad for (x,y). If no 
such (x, y) exists, then all non-chosen i on all (x, y) consistent with (x, y) are bad, and the 
error on them has been accounted for above. Hence we assume we can find (x, y) such that 
i is not bad and not chosen. 

We get 


^ < 


{k + l)n’ 


where the first inequality holds because i is not bad, and the second one holds because i is 
not chosen. 

We still consider a lucky i for (x, y). The probability, conditioned on x, that the first 
intersection is at the position i satisfies 


fi si <rf = Pi q'f s 


< 


(k + l)- p^q'i^ 


< 


e 

5 

n 


where the second inequality holds because i is lucky. Therefore, 


^ Prob{x,y)r^sf <€, 
i,x,j7:lucky 


where we do not count error contributed by z’s that are bad on inputs x, y, but the first inter¬ 
section of X, y. So a first intersection occurs on a non-chosen lucky position with probability 
at most e (again, ignoring error from bad i,x,y). 

It remains to consider the case of non-chosen non-lucky positions. The difference from 
the previous (lucky) case is that now pf > {k + l)p^ /e^ - that is, being conditioned on x, the 
event “Xj = 1” is much more likely than conditioned on y. We will see that in this case an 
upper bound on the probability of “Xj = Y) = 1” follows, essentially, from the assumption 
that I^(X :Y)<k- the core assumption used in the proof of the following. 

Lemma 24. Assume that for no x or y the conditional probability of non-intersection is less 
than a and that for no x, y and i the probability that Xj = = I is larger than 1/2 when 

conditioned on Ei, Ei and X = x, or Ei and Y = y (i.e., Si> a and rf,rf,ri < 1/2 always). 
Then 

^ pfg/ < 16A:/a -F 68/a^. 
i 

We will apply the lemma with a = e. Before we continue, let us see why we can assume 
that one-sided conditional probability of non-intersection is never less than e and that rf, 
r/ and Xj are never greater than 1/2 for our target distribution. If the original p, is not like 
that, we replace it with p', obtained via sampling {X,Y) ~ p and replacing X by 0” with 
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probability 1/2 and - independently - replacing V by 0”' with probability 1/2. Note that 
satishes the lemma requirement (as long as e < 1/2), (X : Y) < : Y) and any 

protocol solving DISJ over /r' with error at most 5 solves it over ^ with error at most 45. 

The probability that the first intersection is at a non-lucky non-chosen (and not necessarily 
non-bad) position satisfies 






< 


Ee: 


f-^i 

x,y 


k + 1 


< 


(16/c/e -|- 68/e^ 
k + 1 


< 84e, 


where the first inequality holds because i is non-lucky and the second is Lemma [SI with 
a = e. 

Taking into account all the steps of our analysis (including possible replacing of // by //'), 
the error of our protocol is 0(e). Via appropriate re-scaling of e, we get the required result: 

Theorem 25. Ql^^{DISJ) < o(^{k + l)n-logn- 


It remains to prove the lemma. 

Proof of Lemma m The following argument extensively uses the assumption that 


k>P{X:Y) = D{fi\\a), 


where a is the product of the marginals of /r. We need some new definitions: For all i,j G [n], 
denote by ai the product of the marginals of pi, and by the distribution pi, conditioned 
on the event Xi = xi,...,Xj = Xj,Yi = which we abbreviate by 

Similarly, is Uj conditioned on . Note that for the latter probability distribution 

we first take the product of marginals of /ij, and then condition. This is different from 
considering conditional mutual information, in which one would hrst condition and then take 
the product of marginals. We also stress that here j denotes the length of x, y, unlike before. 
In the following, when we do not mention j explicitly, it is z — 1: e.g., Let 

kfy D{pfy\x„ Yi)) 


and ki 

‘ x,y i 




Observe that pfgf is the probability that W = 1/ = 1 under the distribution {Xi,Yi). 
As cJi is a product distribution, conditioning on Y does note change the probability of W = 1, 
and so. 


pf = Prob^,{Xi = l\X = x,Y = ^. 


We can now use LemmafT^to conclude that either of of < 4r/’^or i4(///’^(Vi, V)! |cj/’^(Vj, 1/)) > 
pfqf/16. Hence, 


pfqf < 4.rfy + l(lD{pfy{Xi,Y^\\afy{Xi,Yfi) = 4rf’^% 16fcf 


and 


E < E Eg,- ('‘T’ + 16 fef) = 4 EiCrf» + 16 ^ fc 
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Note that 


< 1, 

i 

and since we are assuming that Si> a always, 

pfq^ < 4/a + 16^ fei < A/a^ + 16^ A:*. 


( 1 ) 


It remains to bound '^ki by k/a + Ajc?. Note that for (x,y) € Ej+i it holds that 
/rj+i(x,y) = [ii{x,y)l{\ - r^), and so, 

E E ^ 

y':(x,y')eEi+i * a:':(x',y)eEi+i * 

/ii(x,^0 • (1 -rf) ^i(x',y) • (1-rf) 

/ —1 _ 7^ . / ^ 


• 

y':{x,y')&Ei * x':{x',y)&Ei 

^^i{x) ■ f^iiy) ■ {I - rf){l - r^) 


1 - r,: 




(1 - ri)2 

(1 -rf)(l - rf) 


(1 - n) 


.'is 


Then 


D{yi+i\Wi+i) 

-- k'i+i{x,y)'^og 

{x,y)&Ei+i 


yi+i{x,y) 

(^i+i{x,y) 


k-ijx^y) _ yi{x, y)-{l- n) 


< 


{x,y)^Ei+i 

yiix,y) 


E T 


1-ri ai{x, y){l - rf )(1 - rf) 
yi{x,y) • (1 - ri) 


log 


-n cri(a:,y)(l - rf)(l - rf) 


{x,y)GEi 

yi{Ei\Ei+i) log{Ayi{Ei\Ei+i) 


< 


E T 


1-ri fJi(Ei\Ei+i) 

, „yiix,y) 


{x,y)GEi 


- Vi 


log 


(Xi{x,y) 


+ E 

(x,y)eEi * 


1 - 7’i 


_ rj • log(4ri) 

(l-rf)(l-rf) 1-ri 


+ 2 yi{x,y) ■2-{rf+ rf)-2rilog{Ari) 

+ 12r* - 2ri log(r,), 

I-Xi 


( 2 ) 


(3) 

(4) 


where in ([2]) we used Lemma [6l in ([3]) substituted r* = yi{Ei \ Ej+i), in (jH) used the fact 
that — log(l — A) < 2A for 0 < A < 1/2, and repeatedly applied rf,rf,ri < 1/2. The above 


16 






















inequality tells us that the corresponding relative entropy increases only slightly between the 
positions i and i + 1. 

Next we fix Xi = xi,...,Xi = Xi and Yi = = in, and look at the term 

comparing it to Recall that the involved distributions 

and are on Xj+i,..., 1 ^+ 1 ,..., and they are determined by the values x = 

xi,...,Xi and y = yi,... ,yi. Below we assume that Xjyj / 1 for all j < i (otherwise the 
input is not in the support of //j+i and the following upper bound on holds 

trivially). 


r^/ x,yA\\ x,yA\ 

= ^ Mi+i(x,ylx,y)log 

x,y£Ei+i:(xi,...,Xi)=x,(yi,...,yi)=y 

< ^ Mx,ylx,y)log 

x,y€Ei:(xi,...,Xi)=x,(yi,...,yi)=y 


yi+i(x,ylx,y) 
cri+i(x,ylx,y) 

/Ji(x,ylx,y) 


(7i(x,ylx,y) ■ (1 -rf)(l -rf) 


<D(/rp| 

\<dY + ] 


x,y:(xi,...,xi) 

<Diyfy'^\ 

\<dY + ] 


x,y:{xi,...,xi) 

=D{yfy'^\ 

|ap) + 2rf + 2rf. 


/j.i(x,ylx,y) ■ log 


1 


,(l-rf)(l-rf) 

f 

IJ.i{x,y\x,y){2rf + 2rf) 


(5) 

( 6 ) 


where in ([5]) and ([6]) the condition -Bj+i is satished by all the considered inputs (this is 
implied by the values {x,y)), and so, no “re-scaling” happens while going from //j+i to yi. In 
particular, 


x,y 




cr. 


x,y,i^ 


< 


i+1 ) — 


x,y 




x,y,i\\x,y,i 


O', 


*) + 4ri. 


(7) 


We are ready to bound — Si Note that yi = y, let us use the “chain 

rule” for relative entropy: 


k >D{y\\a) 

=D{yi{Xi,Y,)\\ai{Xi,Y,)) 

+ E^,ly^D{y^,^’y^{X2,...,Xn,Y2,...,Yr,)\\a^,^’y^X2,...,Xn,Y2,...,Yn)) 

>D(;ri(W,Fi)||cTi(Xi,yi)) • (1 - n) - 

=D{y,iXi,Y,)\\aiXi,Y,)) 

+ E^ly^D{y^,^’y^{X2,Y2)\\a^^’yHX2,Y2))-il-n) 

+ ^^xlx,,y,,y,D{y™^’y^iX3^ • • •)11^2(^3, ■ ■ ■)) ' (1 ” h) " 4ri 
>I)(//i(Xi,Fi)||ai(Xi,Fi))+E^J,^^Z)(/r^i’^^(X2,F2)lk^^’"^(X2,F2))-(l-ri) (9) 

+ • (1 - n) - 4ri - 4r2 • (1 - n) 

>I)(/ii(Xi,Fi)||ai(Xi,Fi))+E^J^^^D(/r^i’"HX2,y2)lk2^’"^(X2,y2)) • (1 - ri) 

+ • (1 - ^i)(i - ^ 2 ) - 4n - 4r2 • (1 - n) 
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^ E y,)l kf • a - 4 r, 

i i 

= ki ■ a — A/a, 
i 


where ([5D and Q followed from ([7]) and in the last inequality we used the assumption that 
nr=i(l -n) > a and 


<11(1 + "*)^ n 137- 



This means that ^ /c* < k/a 4/a^, as required. 


^Lemma\24\ 


4.3 Lower Bound 

We use exactly the same hard distribution for the quantum case as for the classical case, see 
Section 3.2, where also the mutual information of this distribution is shown to be at most k. 
Conveniently, Razborov [23j has done most of the hard work for us by analysing the quantum 
complexity of Disjointness for all set sizes. We get the following main result: 

Theorem 26. The distributional quantum communication complexity of Disjointness under 
lj,n,k "Is at least D{{n{k + 

Proof. Recall the distributions i'n,kj (^n,k as dehned in Section 3.2. These are the distributions 
of sets of size s = 0{\/n{k + 1) from a size n universe (not intersecting resp. intersecting). 
We employ the following result by Razborov |23j : 

Fact 21. Any quantum protocol that solves DISJ with error e under i'n,k error e under 
On^k needs communication D(-y/s) = D{{n{k + 1))^/^). 

This follows from Razborov’s proof, in which given a quantum protocol with communi¬ 
cation c for DISJ (on inputs of size s from a size n universe), a uni-variate polynomial of 
degree 0(c) on {0,1,..., s} is constructed such that p{i) is close to 0 for all {0,1,... , s — 1} 
and p{s) = 1. Such a polynomial must have degree D(y7). The construction is done by 
averaging of the acceptance probabilities on all inputs x,y where x,y have size s, and hence 
it is enough if the given protocol for DISJ is correct on average inputs under Un^k and under 
an,k- But any protocol with small error under pLri,k must also have small error under both of 
these distributions, and we get the same lower bound under this distribution as in the worst 
case, as stated by Razborov. ■ 

We also note that again, the error dependence cannot be poly-logarithmic. The proof is 
the same as in the classical case. 

Theorem 28. Ql-^{DISJ) > D((n/e)^/'^) for e > D(I/n). 

We again obtain this following. 

Corollary 29. The class of distributions with information k with 1 < k < j^g ^ot 

boost-able for quantum protocols. 
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5 Large Correlation is Needed for Tight Bounds 


In this section we show that there is a function, for which the distributional communication 
complexity is far from the randomised communication complexity if the information in the 
distribution is less than 0(n). The main idea is that random sparse problems make it hard 
for low information distributions to ‘focus’ on the 1-inputs. 

Define fn,d as a random variable that takes as its values functions / : {0,1}"' x {o,ir ^ 
{0,1}. The functions are generated randomly as follows. Each input x,y is chosen to be a 
1-input independently with probability d/2”'. 

Note that the communication matrix of fn,d has expected d 1-inputs for each row and 
column. In the following d should be thought of as some value like 2'^. We need > 

d > 6n. 

We hrst show that the complexity of fn,d is 0(log d) with high probability. Then, we show 
that with high probability fn,d has a property that allows an O(logn) protocol under all low 
information distributions. 

First we note that by the Chernoff bound the probability that a row or column has more 
than 2d or less than d/2 1-inputs is at most 2e~^^^ < 2“^”. By the union bound it is true for 
all rows and columns (with high probability) that they contains between d/2 and 2d 1-inputs. 
Throughout this section we assume that fn,d has this property. 

Lemma 30. R{fn,d) < 0(log'^) with high probability. 

Proof. With high probability there are at most 2d 1-inputs {xi,y), ..., {x 2 d,y) in Bob’s col¬ 
umn. If Alice sends a fingerprint of x as in Fact fTHl using 2 log d bits, then Bob can check 
whether x = Xj for some 1 < j < 2d with error 2d ■ < 2/d. If so, then he accepts, 

otherwise he rejects. ■ 

Lemma 31. R{fn,d) ^ D(logd) with high probability. 

Proof. The proof is by the probabilistic method. We use the minimax theorem and the 
following hard distribution: Put 1/2 weight on 1-inputs and 1/2 weight on 0-inputs to fn^d- 
Note that the mutual information of this distribution is D(n): for 1-inputs, given x there are 
at most d inputs y out of 2” such that x,y is a 1-input. Hence the information is at least 
(n — log d)/2. 

We employ a 1-sided version of the discrepancy method (this is a relaxation of the 1-sided 
rectangle corruption method). The 1-sided discrepancy under a distribution y is disc'{f, y) = 
maxij//(/“^(l) n ii) — y{f~^{0) n R), where the maximum is over all rectangles. Then 
R'^if) > — log disc'{f, y) — 0(1) for all y that put weight 1/2 on the 1-inputs. Our goal is to 
show that the 1-sided discrepancy is small with high probability over the choice of fn^d- 

Fix a rectangle R and consider a random f^^^. We would like to compute the probability 
that disc' (R) = hif/Zdi^) Hi?) — hUndW O R) is large. Note that this is a random variable 
and that y depends on f^^d 

If /i(i?n/“^(l)) < 4/d^/'^, then disc'{R) < 4/d^/'^ and we are done. Hence we assume the 
opposite. For R to contain at least a 4/d^/^ fraction of all 1-inputs it must be the case that 
R contains at least (4/d^/^) • 2”d/2 1-inputs, and no row or column contains more than 2d of 
them, which implies that R must have at least 2”/d^/‘^ rows and columns. 

Write R = A X B, where |A|, \B\ > 2P/d/!^. The expected number of 1-inputs in R is at 
most |A|-|B|-(i/2”. The 1-inputs are chosen independently, and the Chernoff bound yields that 
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Prob{R contains more than {l + d~^^'^)\A\\B\d/2'^ 1-inputs) < e.-\A\B\d/(‘i- 2 '^d}/'^) < 

Similarly, we can bound Prob{R contains less than (1 — d~^^'^)\A\\B\d/2'^ 1-inputs). 

Furthermore, since there are at most 2^"^^ rectangles, by the union bound with high 
probability these estimates are correct for all rectangles with enough rows and columns (in 
particular the rectangle consisting of all inputs). 

Note that R contains at least |^| • \B\ — |^|2d 0-inputs, each of which have weight at 
least 1/(2^”+^), for a total 0-weight of at least |^||il|/2^"'+^ — (i/2”. The weight of a single 

1- input is at most 1/(1 — (i“^/^) • l/(d2”+^) and the total 1-weight of R is at most (1 -|- 

(i“^/^)/(l — ■ |A| |ii|/2^”'’'^ by the above. Hence the one-sided discrepancy is at most 

0(d-l/2|^||5|/22ri+l) < 0((i-V2), g 

We will now show that most functions fn^d are easy under all low information distributions, 
but hard for information n distributions, by showing that fn,d has a certain property with 
high probability. We assume in the following that (i < 2^ ” and set e = 1/10. 

Definition 32. We say a Boolean 2” x 2” matrix is good, if it is true that every rectangle 
Ax B with min{|2l|, |i?|} < 2^”/^ has no more than 100max{|2l|, |i?|} 1-entries. We also call 
any rectangle Ax B with min{|A|, |il|} < 2^”/^ in a good matrix good. 

Lemma 33. With high probability the communication matrix of fn^d is good. 

Proof. Fix A,B. Assume that \B\ > |A| and that |A| < 2^”/^. The probability that a fixed 
x,y is a 1-input is d/2'^. The probability that there are at least 100|i?| 1-inputs in R is at 

(IS) ■ 

There are (|A|)(|B|) — (e2”/|H|)^l^l rectangles of this size. By the union bound the 
probability that there is a rectangle that is not good is small. ■ 

Now assume that / (or rather its matrix) is good. Consider any u such that I{X : Y) < 

e^n. We have to give a protocol for / under n. By Fact 0] there is another distribution /r, 

that is e/2-close to n and has /oo(-^ ^ S) < 8e^n. We describe a protocol for / under g with 

2 

error e/2. The same protocol has error at most e under v. We assume d < 2*^ ”. 

Alice and Bob consider the marginal distributions gA and gs- Alice sends 0, if fJ.A{x) < 

2 - n/ 2 -tn^ and 1 otherwise, while Bob does the same for gsiy)- We first consider the rectangle 

i?oo of inputs on which the messages were 00. Then pla{x) ■ iisiy) < 2“”“^*^” for all x,y in 
i?oo- Hence on this rectangle Ylx y&R f{x y)=ihA{x)gB{y) < 2^2“^*^”. That means that under 
hA X gB the probability of 1-inputs in i?oo is at most 2d2“^'^”. But since Ioo{X : Y) < 8e^n, 
the probability of 1-inputs there under g is at most We can reject on Rqq 

without introducing much error. 

Now consider one of the remaining rectangles, say i?io = A x B (the rectangle where 
Alice sent 1 and Bob 0). Clearly |A| < 2”/^+'^”. Assume |A| < \B\. By the above lemma 
this means that A x H is good, i.e., contains relatively few 1-inputs, on average only 100 per 
column. 

On i?io Alice and Bob can send public coin fingerprints of x, y each, with error guarantee 
e/1000 (see Fact llOp . This takes communication 0(—loge). If a column (or row) contains 
few 1-inputs Alice resp. Bob can test with the fingerprint whether x, y is one of these. But 
i?io only contains few 1-inputs only on average, and it is quite possible that both the row 
and the column of x, y have many 1-inputs. Namely, while the (uniformly) average column 
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has at most 100 1-inputs, the distribution on i?io is not uniform, but instead rows with more 
1-inputs have increased probability, so we need to be more careful. 

Let A = Aq and B = Bq (we still assume that 1^41 < Define Ai as the set of x G Ai-i 
such that there are at least 1000 1-inputs x, y' with y' G Hi-i and Bi the set of y G Si-i 
such that there are at least 1000 1-inputs x',y with x' G Ai-i. 

Clearly, all AixBi are good. Assume that \Ai\ < \Bi\. AiXBi has at most 100|ilj| 1-inputs. 
Ai X Bj+i has at least 1000|ili+i| 1-inputs, hence < |Aj|/10, because Ai x Sj+i is good: 

1000|ilj+i| < 100max{|Aj|, |ilj+i|}. That means that for odd i we have \Bi\ < |Aj_i|/10 and 
for even i we have \Ai\ < |i3j_i|/10. 

All sets Ai, Bi are known to Alice and Bob without communication. Also, due to the 
shrinking sizes, all i < 0{n). 

The protocol works as follows: Alice determines the hrst i such that on Ai x her 
row contains at most 1000 1-inputs and sends this information. Bob also sends the index j, 
such that on Aj-i x Bj his column contains at most 1000 1-inputs. If i < j, then Bob also 
sends a fingerprint of y with error guarantee 1/10000 (see Fact [TU|). If there is a y' G Bj_i 
with the same fingerprint and f{x,y') = 1 then Alice accepts, otherwise she rejects. If z > j, 
then Alice sends the fingerprint, and Bob accepts if and only if there is an x' G Aj-i with 
f{x',y) = 1. Clearly the communication is 21ogn -|- 0(1), and is done in 2 rounds. 

Correctness: Assume i < j. The players can be sure that x,y G A* x Bj_i. There are 
at most 1000 1-inputs in row x in Bj_i. If f(x,y) = 1, then certainly the fingerprints will 
coincide, and Alice accepts. Otherwise the probability that the fingerprints equal is at most 
100/10000 = 1/10. 

Lemma 34. Under any v with information at most e^n and for 6n < d < 2'^^"' we have that 
K{fn,d) < O(logn), if fn,d is good. 

Theorem 35. For every 6n < d < 2 ”Aoo {g ^ function fn,d such that 


• R{fn,d) = &{logd), 


p/<n/1000 

-^ 1/10 


{fn,d) < O(logn). 


6 One-Round Error Dependence 

We now consider the general question of error dependence under distributions with limited 
information. In the case, where the information is bounded only by n, we get the standard ran¬ 
domised (resp. quantum) communication complexity, for which the usual boosting techniques 
(i.e., the Chernoff bound) show that the error dependence is at most factor of 0(log(l/e)). 
Furthermore, Corollary [22] shows that for all information parameters 1 < k < the 

error dependence is polynomial. This leaves the case of product distributions, where in the 
randomised two-way communication case DISJ has logarithmic error dependence. In this 
section we show that for all total functions, in the case of one-way communication complex¬ 
ity the error dependence is small under product distributions. The corresponding statement 
about two-way protocols remains open. 

In [TU] Kremer et al. show that the complexity of one-way protocols for total functions 
under product distributions is determined by the VC-dimension (see also HZ!)- 
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Definition 36. The VC-dimension of a Boolean matrix M is the largest k such that there 
is a2^ X k rectangle R in M such that R contains all Boolean strings of length k as rows. 

The VC-dimension in turn characterises the number of examples needed to PAC-learn 
the concept class given by the rows of the communication matrix of /, under any distribution 
on the columns. Usually in learning theory a concept class is a set of Boolean functions 
(’concepts’), and here we view rows of the communication matrix of / as functions fx{y) = 
f{x,y). The task of PAC learning is for the learner to be able to compute fxiy) for most y 
under a distribution /x, after having seen labelled examples from the same distribution. It is 
well known, that 0{VC{f) • 1/e • log(l/e)) examples suffice [T7] . 

Kremer et al. m proved the following upper bound on one-way communication complex¬ 
ity: R^^^’^~^{f) < 0{VC{f) ■ 1/e • log(l/e)). The idea is that Alice and Bob can choose 
examples y' from the public coin, which Alice can label by sending f{x,y'). Bob simulates 
the PAC learning algorithm for the rows of the communication matrix, and hence he can suc¬ 
cessfully predict f{x,y) for most y, including (likely) his own input. Note that there is also 
a lower bound of > (1 — H{e))VC{f) (which is even true in the entanglement 

assisted case with an additional factor of l /2i[T9l [3l [T8]. 

While it is known, that the number of examples needed to PAC-learn is at least /e) 

m, we get an exponentially better dependence on the error here for the one-way communi¬ 
cation model under product distributions. 

Our result has an appealing interpretation. Both the one-way model under product 
distributions and the PAC model can be viewed as learning models (for this it is crucial 
that the distributional one-way model is considered under product distributions). In the 
PAC model Alice (or nature) labels random examples drawn from a distribution, and Bob 
has to end up being able to label new examples mostly correct (under the same, unknown 
distribution). In the one-way model, there is a known distribution on examples (columns), 
and a known distribution on concepts (rows). The one-way model under product distributions 
can clearly simulate any PAC algorithm. But Alice can send any information she deems useful, 
not just label examples. Nevertheless, in both models the complexity is determined by the 
VC-dimension. Is a teacher like Alice not more useful than random labelled examples? We 
show that the one-way model (i.e., a teacher) is better in the sense that making the error 
small is exponentially cheaper there, compared to the PAC model. 

Theorem 37. For all total f: ■ log(l/e)) 

Proof. First, = &{VC{f)). Hence we need to show only that < 

0(UC(/)-log(I/e)). 

For a given distribution /i on the columns, an e-net among the rows of the communication 
matrix is a subset N of the set of rows, such that for every row x there is a row x' £ N which 
coincides with x with probability 1 — e under fj.. We have the following simple observation, 
due to the fact that Alice can simply send the name of the closest x' & N to Bob. 

Lemma 38. is upper bounded by the logarithm of the size of the smallest e-net 

for f and hb- 

Hence instead of the simulation Alice and Bob can agree on an e-net beforehand, and the 
size of the e-net determines the complexity of the protocol. Note that PAC-learners also try 
to find an e-net, but they are restricted to finding one from random examples. The size of 
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the constructed e-net is much smaller than the number of examples (this is not surprising, 
since otherwise the concept is not learned yet). Indeed, Sauer’s lemma tells us enough about 
the size of the e-net, when the specihed number of examples have been chosen. 

Fact 39. Let M he a Boolean matrix with r rows and c columns and VC-dimension d. Then 
r < ^{c,d), where $(c,d) = Ei=o,...,d (D ^ ^ ' id)- 

We now state the fundamental result from PAC learning (see Theorem 3.3 in IZl)- 

Fact^O. Consider any function f : {0, 1} x {0, 1}"' —)• {0, 1}. Assume there is a distribution ^ 
on y’s that does not depend on x, and x is also fixed but unknown. The PAC learner is given 
c = 0{VC{f) • 1/e • log(l/e)) random examples yi,... ,yc from the distribution together with 
labels ii = f{x,yi),... ,£c = f{x,yc)- If fhe learner chooses any x' that is consistent with 
these values (i.e., f{x', yi) = ii for alii = 1,... ,c), then the probability that f{x', y) / f{x, y) 
is at most e under y. Hence, if we choose a string x' consistent with any vector ii,... ,ic, 
then we get an e-net for /, y. 

The size of this e-net is clearly at most 2‘^. Sauer’s lemma can now be used to show that 
the constructed e-net can be made much smaller. The size of the e-net constructed in Fact 
SQl without repetitions, is at most the size of the set of distinct rows in the matrix for /, 
when we restrict the matrix to the c chosen columns (we may choose one x' for every distinct 
value of the c labels appearing and add it into the e-net). 

The size of the number of distinct rows is bounded now by Sauer’s lemma as follows: 
VC{f) ■ (yJ(j)) =VC{f)- ^ (1/e)'^*^^'"*^'^)^ Hence the communication 

is at most the logarithm of this size, which yields the theorem. ■ 

7 Open Problems 

• Can the error dependence of a tight upper bound on Ql^^{DISJ) be improved to 
log(l/e)? 

• Can the error dependence of be improved to log(l/e) for every total function 

/? 

• What is the trade-off between the number of rounds and the randomised complexity of 
DISJ under product distributions? 

• What is the quantum communication complexity of DISJ where the inputs are sets of 

size ^/n from a size n universe? The best known lower bound is the best known 

upper bound is 0(n^/^ logn). 

• What is the largest gap between and In the one-way model there 

is at most a constant gap for any total function. We have shown a quadratic gap for 
DISJ. 
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Appendix 

8 Randomised Protocol for DISJ under Product Distributions 

Proof of Theorem\17\ Fix any product distribution p, on {0,1}” x {0,1}”. The main idea 
is (just like in [4]) to have a hrst phase in which large sets are reduced in size until both 
sets have size 0{y/n). In phase 2 we employ the randomised protocol for DISJ on small sets 
given by Hastad and Wigderson [11] (instead of communicating the sets). To simplify our 
presentation we describe a randomised protocol. 

Set S = y/n. In phase 1 Alice and Bob try to shrink the universe U (without removing 
positions in x n y) until the size of U is at most S. At that point also \x n U\ and \y n U\ 
are at most of size S and the players move to phase 2. The protocol starts with the universe 
Uq = {1,... ,n}. The players maintain a current universe Ui until Ui is small at some point. 

The protocol proceeds in rounds during phase 1 (we later explain how to get rid of all 
but two rounds). In each round Alice and Bob exchange a bit each, indicating whether 
|3:|) \y\ > S OT not. If both are smaller, they move to phase 2. The players also maintain a 
current rectangle of inputs Ri = AiX Bi (this would be immediate in a deterministic protocol, 
but needs to be maintained in the randomised case). 
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After this exchange, Alice and Bob each compute Prob{x, y are disjoint) on the current 
distribution restricted to Ri and their row/column. If this probability is less than e for 
someone, they reject and quit the protocol. Otherwise, one player who has a large set still, 
say Alice, uses the public coin to generate samples y' G Bi. These are disjoint from x with 
probability at least e. Hence, Alice can name a disjoint y' with expected communication 
0(log(l/e)). Since xCiy is disjoint with y' they set C/j+i = Ui — y'. The size of the universe 
decreases by at least ^/n in each round in phase 1, the communication is expected 0(log(l/e)) 
per round, and there are at most ^/n rounds. 

Phase 2, as mentioned, is the protocol from m, which solves DISJ with communication 
0{^/n log{l/e)) and worst case error e on sets of size at most ^/n. 

Hence the total expected communication is at most 0{y/n\og{l/e)). We need a protocol 
with a worst case communication bound, though, but note that during each round in phase 1, 
using the public coin to pick a new y' corresponds to a Bernoulli trial with success probability 
at least e. The communication cost is the logarithm of the number of the hrst successful trial. 
The probability that this is larger than tlog(l/e) is at most e~^l^ . Assume there are 

T rounds in phase 1. The probability that the message length in any round is more than 
(T + 1) log(l/e) is at most T ■ < e. Hence we can assume that the message length is at 

most (T + 1) log(l/e) in all rounds (the probability that this is not the case is bounded by e). 

We now bound the probability that the total message length is more than 10Tlog(l/e), 
by appealing to the Hoeffding bound. Note that the message lengths of all rounds are (still) 
independent, and that we just established an upper bound on the message length. The 
Hoeffding bound now implies that the probability of the total message length being larger 
than the stated bound is at most e. Furthermore, we have that T < y/n with certainty. This 
shows that the communication of phase 1 is at most 0{y/n\og{l/e)). Note that the protocol 
needs to be modified such that it aborts if the communication in phase 1 exceeds this bound. 
This introduces error at most e. ■ 

9 Randomised Protocol and Distributions with Bounded Mu¬ 
tual Information 

Proof of TheoremAltA Fix any distribution //' that has information at most k. The protocol 
we describe again has 2 phases. Informally, the first phase shrinks the sets of Alice and 
Bob (which could be arbitrarily large) until their sizes are both small enough. The second 
phase is small set disjointness, as considered before by Hastad and Wigderson and more 
recently by Saglam and Tardos [23]. We will establish an upper bound of 0{^Jn{k + l)/e) 
on the expected communication complexity with error e. Then the theorem (which claims a 
worst-case bound) follows via the Markov inequality: if the stated communication bound is 
violated, stop the protocol and output a random bit. 

Set S = \J{k -\- l)n. The goal of the first phase is to make both sets smaller than S. 
Suppose Alice holds x and Bob y. They communicate to determine one of them has a set 
larger than S. This needs communication 0(1). If both sets are small we move to phase 2 
described below. 

In phase 1 Alice and Bob try to shrink the universe U until the size of U is at most S. 
At that point also \x n U\ and \y n U\ are at most S and the players move to phase 2. The 
protocol starts with the universe Uq = {1,... ,re}. The players maintain a current universe 
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Ui until Ui is small at some point. 

Note that while the information under /tq = /U is at most k, in some branches of the 
protocol the information on the current sub-rectangle can grow, and we need that on average 
it is bounded by k. We keep a transcript Ti = Ai,Bi,Ci, where Cj is the public coin 
randomness, and Ri = Ai x Bi is the rectangle in the communication matrix induced by the 
messages up to round i, for public coin value Cj. Note that the rectangles Ri form a partition 
of the communication matrix if we fix Ci (since they result from a deterministic protocol 
then). By Lemma [TOl we have that I{X ; Y\Ti) < I{X ; Y). 

Denote by //j. the distribution on inputs conditional on the transcript being R = ti. /if. 
is uti restricted to the row X = x. denote the distributions restricted to 1-inputs 

of DISJ. Hti,Y is the marginal of /it. on Bob’s inputs, /if, y is the distribution on y’s under 
Ti = ti, for fixed x and conditioned on xfl// = 0. /if. y is the distribution on y’s under Tj = ti, 
for fixed x. 

Here is the protocol for phase 1. Explanations follow. 

1. Alice and Bob check whether |x| < S and \y\ < S on Ui. If both are, they move to 
phase 2. W.l.o.g. assume that |2/| > S, otherwise the following steps are done by Bob 
in an analogous fashion. 

2. Alice computes the probability that DISJ{x,y') = 1 if //' is chosen from /rf.. If this 
probability is less than e/2, she ends the protocol with output 0. 

3. Alice computes /if. y. Another distribution, this one known to both players, is 

4. Alice and Bob use rejection sampling as in Fact [8] (using the distributions /if. y and 
k‘ti,Y) to discover a y' distributed according to /if. y. 

5. Alice and Bob set I/j+i = Ui — y[. 

6. tj+i is ti together with the message and randomness from 1. /rtj+i is /x conditioned on 

Ti-\-l — ti-i-i. 

7. Move to step 1. 

We note the following on the different steps. 

1. Communication is 0(1). 

2. Clearly the total error introduced by these steps under /x can never be more than e/2. 
If the protocol moves ahead the probability of DISJ{x,y) = 1 is at least e/2 under /xf.. 

3. Since I{X : Y\Ti) < k we have that rlli^ii,v) < k. 

4. 0(/if, y I l/Xf .^y) < 2(0(/xf, y ll/xt-^y) -|- l)/e — log(e/2) due to Lemma [7] and hence the 
rejection sampling protocol from Fact [8] uses expected communication 0{{k + l)/e). 
Drawn y[ are always disjoint from x. 

5. \y'i\ > S. Hence \Ui — > S. This step can be performed at most nj^njik Y 1) 

times. 
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The protocol ends phase 1 with sets x n Uj held by Alice and y n Uj held by Bob, and 
\x n C/jl, \y n Uj\ < S, and DISJ{x,y) = 1 DISJ{x n Uj,y n Uj) = 1. The probability 
that the protocol ends during phase 1 and makes an error is at most e/2. The expected 
communication is at most 0{^Jn{k-\-Y)/e. 

Phase 2 is simply the Hastad Wigderson protocol for small set disjointness that 
finishes the protocol in communication 0(y^n(fc + 1) log(l/e)) and with worst case error e/2. 
Hence we get a protocol with error e, and expected communication 0{\Jn{k + l)/e). 


10 Randomised Lower Bound for DISJ 


We first bound the information. Letting X and Y follow the marginal distributions of yn,k-i 
respectively, we have; 


I{X : Y) = H{X) - H{X\Y) = log 
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- log 


n — m 


m 


= log 


n — m 
m — 

n{n — 1) • ... • (n — m + 1) 


< log 1 + 


m 


n — 2m + 1 


(n — m) ■ ... ■ {n — 2m + 1) 
2 


< (loge) 


m 


n — 2m + 1 


= c^(loge)(l + o(l))(A: + 1) < A: 
for any sufficiently large n. 


Proof of Theorem\2(A We may assume that k = o{n), since otherwise (if k = H(n)), the 
original proof by Razborov |22] applies directly. Let / G N be given and assume that n = 4Z —1. 
Let 7 = log;(cy^ n{k + 1)), where c = (loge)“^. Thus 7 G (IjI) (for n sufficiently large) 
and our distribution will pick sets of size P = c-\Jn{k + 1). Throughout the proof we will 
treat numbers like P as natural numbers, and avoid using the floor function for the sake of 
readability. We will also identify ^*({1,... , n}) with {0,1}". 

We now give an alternative definition for the distribution y = yn,k, as the distribution 
induced by the following process: First, a triple T = (Ti, T 2 , i) is chosen uniformly among all 
such triples, where |Ti| = IT 2 I = 2Z — 1 and {Ti, r 2 ) {*}} form a partition of the set {!,..., n}. 
Then, with probability ^ the set x is chosen uniformly among all subsets of Ti U {f} with 
P elements and such that they contain i, and with probability ^ the set x is chosen as a 
subset of Ti with P elements, again uniformly among all such subsets of Ti. Similarly, and 
independently of the choice of x, y is chosen with probability | uniformly as a subset of 
T 2 U{f} with P elements and such that it contains i, and with probability ^ uniformly among 
the subsets of T 2 with P elements (not containing i). Thus non-zero probabilities are assigned 
only on the set {(x, y) | x, y C {1,..., n}, |x| = |y| = P, |x n y| G {0,1}}. 

Now the statement that {DISJ) = H(y^ n{k -\- 1)) for any sufficiently small constant 
e > 0, follows directly from Lemma HTl below. ■ 
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Lemma 41. Let 7 and /i be defined as in the proof of TheoremW^ Let A = {(x, y) \ fi{x, y) > 
0 and X n y = 0} and B = {{x,y) \ fi{x,y) > 0 and x Ciy 7 ^ 0}. For any sufficiently small 
e > 0 we have for any rectangle R = C x D C V{{1, ... , n})^ that 

r]R)> n R)) - 

Proof. We consider e > 0 to be fixed (but will specify its value later). We begin by defining 
for any triple T = (Ti,T 2 ,{i}) as above, the numbers Row{T) = Pr[x G C | x C Ti U {i}], 
Rowq{T) = Pr[x G C I X C Ti U {i},i ^ x] and RowfiT) = Pr[x G C | x C Ti U {i},i ^ x], 
and similarly Col{T) = Pr[y ^ D \ y Q T 2 U {i}], ColfiT) = Pr[y G | y C r 2 U {i},i ^ y] 
and Coli{T) = Pr[y £ D \ y C T 2 U {i},i ^ y]. It is important to note that Row{T) = 
^^{RowfiT) + Rowi{T)) and Col{T) = i(Co/o(r) + Co/i(T)), just as in the case ofRazborov’s 
original distribution, and for the same reasons. 

Next, for a triple T = (Ti,r 2 ,{i}) (and under the above distribution y) we say that T is 
x-bad if Rowi{T) < ^Rowo{T) — and that T is y-bad if CoIq{T) < ^Colo{T) — 2“'^"'^. 

If T is x-bad or y-bad, we say that T is bad. Let Badx{T), Bady{T) and Bad{T) be the 
respective event indicators. 

Claim 42. For all ^2 LL {1, • • • ,n}, with |t 2 l = 2/ — 1, we have that Pr\Badx(T) = 1 I r 2 = 
t 2 ] < I and Pr[Bady{T) = 1 | r 2 = ta] < f 

Proof of the Claim. We prove the first statement, the second one having an almost 
identical proof. 

Let t 2 P with |t 2 | = 2^ — 1, be fixed. Under our distribution, Row{T) can 

take different values even when T is restricted to partitions for which T 2 = t 2 . Thus we first 
treat the case when max{i?o'u;(T) | T 2 = t 2 } < 2“'^"'^. If this inequality holds, then for all T 
with T 2 = t 2 we have: Row{T) < 2“*^"^, and hence Rowq{T) < 2Row{T) < 2 • so that 

Rowo{T) _ 2 -en^ < 0 < RowfiT) holds trivially (and hence Pr[Badx{T) = 1 | T 2 ] = 0). 

Next we treat the case where max{i?o'u;(T) | T 2 = t 2 } > 2“'^"'^. Define S' = {x G U | 
|x| = P,x C {!,...,n} \t 2 }. Note that for any T with T 2 = t 2 , Row{T) measures the 
conditional probability (conditioned on T) of the same set S', with each x G S' having a 
different (conditional) probability depending on whether i G x. Specifically, if i G x then the 
probability of x being chosen, conditioned on T, is , otherwise the probability is 

~ 21 - 1 -^ ~ 2 h-'>'-i • Thus, when T is fixed, the probability of 

each set x containing i is 2fi~'^ — 1 times that of a set which does not contain i. 

The proof of this case will proceed as follows; First, we show that under the assumption 
that a sufficiently large part of the partitions T with T 2 = t 2 are x-bad, three quarters of the 
elements of S (which are subsets of {1,... , n} \ T 2 ) must have at least of their elements 
in a subset of {I,..., n} \ r 2 of size We will then upper-bound the number of subsets of 
{I,..., n} \ T 2 of size P that have this property (regardless of whether they are in C or not). 
Next, we will lower-bound IIjSI in terms of e, and show that for a suitable choice of e, the 
lower bound for ||5| is in fact larger than the upper bound we computed before, which is 
a contradiction showing that it is not possible for that T with T 2 = t 2 to be x-bad for that 
many choices of i. 

Note first that whenever T 2 is fixed (in our case to t 2 ), the choice of i G {1,..., n} \ T 2 
also fixes Ti and hence all of T, and that the choice of i determines the proportion of x G S' 
whose weights are counted in RowfiT). If for a particular choice of i the resulting T is 
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x-bad, then by definition we have that Rowi{T) < ^Rowo{T) — 2“’^"’^, and in particular that 
Rowi{T) < ^Rowo{T). If we let S' be the set of x E S' with i S x, then we may rewrite this 
inequality as: 


\S'\ ^ |S|-|S1 . |S|-|S^| 

(fti) 6 (V) ' ' 6(2/i-7-1) 

^ IS'I (l + ^ ^ < 1^1 

' ' V 6(2Z1-7 - 1)7 6(2/i-7 - 1 ) ’ 

and we may conclude that for I sufficiently large, |S'| < (under the assumption that T 

is x-bad). For the last inequality we have used the fact that lim^^oo = oo, which holds 
because: lim^^oo log = lim„^oo(l - 7 )lo g^ = h mn^oo(l - log 7 cv^n(fc + 1 ))) log / > 
limn^.oo(log I - log ^/^^J¥TT)) > lim^^oo log = oo (since k = o(n)). 

Let -B = {i € {1,..., n} I the partition ({1,..., n} \ (^2 U {i}),t 2 , {*}) is x-bad}, and as¬ 
sume that |B| > that is, assume that for at least one fifth of the possible choices for i 
the corresponding partition is x-bad. By excluding some elements of B, we may assume that 
|B| = Now, if we consider the number of pairs (x,i) with x £ S' and i £ x, we have 

by the inequality in the last paragraph that each of the i £ B can be the second element 
l*S'l 

of at most such pairs, and hence B can contribute the second element of at most 

^ lod-T' ~ total of /"^ISI pairs. Applying the Colouring Lemma below with 

X = S, Y = {1,... ,P}, c(x,i) = 0 if and only if the Lth smallest element of x is in B (so 
that p > ||) and r = ||, we have that at least three quarters of all x £ S have the property 
that more than of their elements he in G = {1,... ,n} \ (t 2 U B). Let Q be the set of 
subsets X C Bug = {1,... ,n}\t 2 ) with |x| = P and the property that |xnG| > ^P- Then 
we must have that \Q\ > ||S|. We will now upper-bound the size of the set Q. 

Since every x £ Q can have a proportion of at most 4/25 of its elements in B, we have 
that 


logIQI < log 



< log 



/ 8le \ 


< log 


±/7 

2b V 5 


y ,21/7 

/8/e 25 

) 


= log 



, . 21^/7 


< + Ir log + |r log (^,.-7 + 0 ( 1 ) 

/ 4 bp 21 4r)c\ 

= (1 - 7)n log ^ Y + ^ log Y“ j 

< (1 - j)P log I + 2.43508 • P + 0(log 1), 


where in the first line we used the inequality (’^) < for each term of the sum. The 

inequality sign between the first and second line can be justified as follows: For x £ (0, ^), 
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consider the expression 


log 


|eZ 

X ■ 


X-P 


Id 


(i-x)n' 


(1 - x)l'y 


Of Hp 

= logl + p ( xlog — + (1 - x) log ^ 


and set f{x) = x log || + (1 - x) log = x log ^ + (1 - x) log + (1 + log e)x + (3 + 

loge)(l — x) — log5 = 3 + loge — log5 + H{x) — 2x. Then /'(x) = H'{x) — 2 = log — 2. 
Note that the function is decreasing but positive on (0,1), and we have that the smallest 
value xo G (0, for which we can have /'(xq) = 0 is xq = |, which implies that /(x), and 
hence also the argument of the logarithm in the expression above, is strictly increasing on 
(0, ^). Thus the terms of the sum 


8le 


V V 


n-i 


are increasing, so that each term is upper-bounded by the final term, which justifies the 
inequality between the the first and second line above. 

Next we compute a lower bound for ll^l. Let T* be a partition with = t 2 and 
Row{T*) = max.{Row{T) \ T 2 = t 2 }- Then we have that ||5| = |[i?or(;o(r*)+ 

Rowi{T*)q\-_\)] > \{Rowo{T*) + Rowi{T*))Q\-_\) = |i?on;(r*)Fi¬ 
nally we have: 


log- 1 51 > log 


2 -enT' 


2 / - 1 
p -1 


> ^'^log (e - 0 ( 1 )) 


21 - 1 
P -I 


— erP — 0 (log 1) 


> (1 - 7 )r log l + P log( 2 (e - 0 ( 1 ))) - e • (4/ - If - 0(log 1) 

(for large 0^(1 — 7 )^"^ log ^ + 2.4426 ■ P — e ■ (4/)'’' — 0(log 1). 

For e < we get the desired contradiction, that ||5| > \Q\. 

(The lower-bound for {fp\) above can be obtained using the Stirling bounds for the 
factorial, \/ 2 ¥n (f)” < n! < e^/n (f)”, as follows: 


21 - 1 
P -1 


> 


^ 2 - k {21 - 1 ) • {21 - 1 ) 


21-1 


e‘^^{P -l){2l-P) • {P - • {21 - / 7 ) 2 «-n 

n-i 


^2tt{21 - 1 ) 




21 - 1 


21 - 1 


{21 - 1) - {P - 1) 


y^27r(2f - 1) 


eV(^^-l)( 2 ^-^^) - 1 


21 - 1 


v-i 


1 + 


P - 1 
21 - P 


2i-n 

n-i 


2i-r 


H p-i 




^ * , 2 / - 1 
“ e 2 ^(^ 7 - 1 ) V ^^-1 


p-i 


{e-o{l)f-\) 


Claim 43. R.[Rowq{T)CoIq{T){ 1 - Bad{T))] > \'E[Rowq{T)CoIq{T)]. 
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Proof of the Claim. Since Bad{T) < Badx{T) + Bady{T), it is enough to prove that 
E[Rowo{T)Colo{T)Badx{T)] < ^Ei[Rowo{T)Colo{T)], with a similar statement for Bady{T) 
being proved in the same fashion. For each t 2 C n}, with \t 2 \ =21 — 1, we will 

show that the desired inequality holds when conditioned on T 2 = t 2 , which implies that the 
unconditioned inequality holds. All triples T with T 2 = ^2 have the same value for CoIq{T), 
so let this value be called d. Also let r = Yl\Row{T) \ T 2 = ^ 2 ]- Now we have: 

^[RowQ{T)ColQ{T)Badx{T) \ T2 = ^ 2 ] 

<d'Ei\RowQ{T)Badx{T) \ T 2 = t2] 

<c'E [2 • 'E[Row{T) I T 2 = t 2 ] • Badx{T) \ T 2 = ^ 2 ] 

<2dr'Ei[Badx{T) \ T 2 = t 2 ] 

2 

<—dr (by Claim 1) 

5 

=ld-E[Row{T) I T 2 = t2] 

0 

=ldB[Rowo{T) I T 2 = t2] 

5 

=^B[Rowo{T)Colo{T) \ T 2 = ^ 2 ] 

The inequality between the second and the third line can be justified as follows; Recall that, 
as observed in the proof of Claim 1, even when considering only triples T 2 = t 2 , the value 
of Row{T) can differ by a factor of at most 2/^“^ — 1. This is due to the fact that Row{T) 
measures the probability (conditioned on T) of the same set S' = {x G C | [xj = P,x C 
{1,..., n} \ t 2}5 but depending on whether i G x (for a particular choice of i and hence of 
T), an X G S will have (conditional) probability either or ■ Thus if T* 

is a triple with Tf = t 2 for which Rowo{T*) = max{Rowo{T) \ T 2 = t 2 }, then Row{T*) 
must be the minimum among all values of Row{T) when T 2 = t 2 , because when T = T* the 
largest portion of elements of S have probability instead of ■ It follows 

that for all T with T 2 = ^2 we have Row{T) > Row{T*) > ^Rowo(T*), and hence that 
¥l[2Row{T) I T 2 = t 2 ] > Rowo{T*). On the other hand we have that for all T with T 2 = t 2 , 
Rowq{T) < Rowq{T*), so finally we get that Rowq{T) < 2E[i?ort;(T) | T 2 = 12 ] for all T with 
T 2 = t2. ■ 

Claim 44. For any rectangle R: ^{BCR) = |E[Rou)i(T)C'o/i(T)] and//(A cR) = ^^[Rowq{T)CoIq{T)] 
(with the expectation taken over all partitions T). 

The proof of this claim is identical to the case where /r is the distribution in Razborov’s 
proof (see m), since the relevant observations also apply to our modified distribution: 1 . 

//(R) = j (and hence /u(A) = |), because for every fixed partition T, i ^ x with probability 
I and i € y with probability independently. 2. i G x and i ^ y are independent events 
(for the same reason). 3. For every {x,y) with x C y = 0 we have that Pr[{x,y) \ {i ^ 
x) A (i ^ y)] = Pr[{x,y) \ {{i ^ x) A (i ^ y)) V {{i G x) A (i ^ y)) V ((i ^ x) A (i G y))], 
because conditioning on either one of the two events induces the uniform distribution on the 
set {(x, y) I X, y C {1,... , n}, X n y = 0, |x| = |y| = P}. 
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We now use claims 2 and 3 to prove the statement of the lemma: 
n i?) = ^[Rowi{T)Coh{T)] 

>^[Rowi{T)Coli{T){l - Bad{T))] 


/ Rowo{T) 


6 


- 2 " 


=-E 

4 


f Rowo{T)Colo{T) 

V 36 6 


( ^olo{T) 

2-en< 


- 2 " 


(1 - Bad{T)) 


(by def. of Bad) 


{Rowo{T) + Colo{T)) + 2 


-2enT' 


(1 - Bad{T)) 


>n {E[RowoiT)ColoiT){l - BadiT))]) - 2-"”^ (since Rowq{T) + Col^iT) < 2) 

>12 {E[RowQ{T)Coh{T)]) - 2-""^ (by Claim 2) 
n R)) — 2“^”^ (by Claim 3) 

Choosing e to be smaller than both the constant in front of fi{A n R) and ^qqq,^^ completes 
the proof. ■ 


Lemma 45 (Colouring Lemma.). Let X and Y be non-empty finite sets, and let c : X xY i—)• 
{0,1} be a colouring of X x Y such that a proportion p E (0,1) of the elements of X x Y 
are mapped to 1, that is, such that |c“^(l)|/|X x Y\ = p. Then for any r € (0,p) such that 
r\Y\ e N, we have that for at least fEf 1^1 elements x & X, |({x} x y) n c“^(l)| > r|y|. 

Proof. We call sets of the form (xj x Y rows, and let the number w{x) = Ylyev ~ 

I ({x} X y) n c“^(l)| be the weight of the row (xj x Y, for each x € X. Let c be a colouring of 
X X y as above, but such that the smallest possible proportion of rows have weight > rlyl, 
and denote this proportion by q. Thus q is such that for any colouring c' satisfying the 
conditions of the lemma, at least q\X\ elements x G X satisfy |({x} x y) n c“^(l)| > r|y|. 

We may assume that all rows with weight < r|y| have weight exactly r|y|: If this is 
not the case, we may repeatedly perform the operation of changing a 0 into 1 on a row with 
weight < r|y|, and a 1 into 0 on a row with weight > r|y|, until the above statement is true. 
(It is easy to see that the colouring c must have rows with weight > r|y|, since otherwise the 
overall proportion of elements mapped to 1 would be < r < p.) This operation leaves the 
proportion of elements that are mapped to 1 unchanged, and the minimality of the chosen 
colouring c guarantees that the number of rows with weight > r|y| does not decrease (and 
therefore remains unchanged). 

Next, we may assume that all but at most one of the rows with weight > r|y| have weight 
exactly |y|: If this is not the case, we may fix one such row, replace all zeroes with ones on 
all other rows of weight > r|y| (thus making their weight exactly |yl), and on the fixed row 
change the same number of ones into zeroes so as to match the changes made on all other 
rows. Again the overall proportion of elements being mapped to 1 does not change, and the 
minimality of the colouring c guarantees that the weight of the fixed row stays > r|y|. 

Based on the above we now have: p|X||y| = g|X||y| — a\Y\ + (1 — ( 7 )|X|r|y|, where 
a E [ 0 , 1 — r) is the proportion of zeroes on the one row that has weight > r|y| but not 
necessarily = |y|. Thus we have: 

p — r 

p < q + A — q)r p < (1 — r)q + r - < q. 

1 — r 
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